2021-12-11

1. Introduction

Students: Berke ASLAN and William BOZZACCHI

Document content and description

This document is the final output for the ESILV, Year 5 Credit Scoring Project.

The project below is done with the programming language R and the document generated with R Markdown.

2. Libraries et datasets

2.1 Libraries

options(warn=-1)
library(tidyverse)
library(dplyr)
library(knitr)
library(skimr)
library(lubridate)
library(gridExtra)
library(scorecard)
library(ggplot2)
library(ggplotify)
library(plotly)
library(questionr)
library(caret)
library(pROC)
library(purrr)
library(caTools)

2.2 Datasets

Now let’s import the dataset and check their given formats.

# Import dataset from the project directory
raw_data <- read_csv("TD4Data.csv")
data_dict <- read_csv("data_dictionary.csv")

# Split dataset into train/test
set.seed(101)
train <- raw_data %>% sample_frac(.8)
test <- anti_join(raw_data, train, by='UniqueID')

# Summarize the datasets
summary(train)
##     UniqueID      disbursed_amount   asset_cost          ltv       
##  Min.   :417614   Min.   : 15579   Min.   : 39684   Min.   :25.72  
##  1st Qu.:499762   1st Qu.: 48182   1st Qu.: 66279   1st Qu.:69.83  
##  Median :569888   Median : 54513   Median : 70802   Median :77.42  
##  Mean   :554380   Mean   : 55468   Mean   : 76608   Mean   :75.43  
##  3rd Qu.:610596   3rd Qu.: 60959   3rd Qu.: 78905   3rd Qu.:84.38  
##  Max.   :653991   Max.   :153318   Max.   :215528   Max.   :94.95  
##  NA's   :1        NA's   :1        NA's   :1                       
##    branch_id       supplier_id    manufacturer_id  Current_pincode_ID
##  Min.   :  1.00   Min.   :12797   Min.   : 45.00   Min.   :   5      
##  1st Qu.: 13.00   1st Qu.:16264   1st Qu.: 48.00   1st Qu.:1484      
##  Median : 61.00   Median :18709   Median : 86.00   Median :2783      
##  Mean   : 70.17   Mean   :19499   Mean   : 69.25   Mean   :3216      
##  3rd Qu.:120.00   3rd Qu.:22991   3rd Qu.: 86.00   3rd Qu.:5232      
##  Max.   :260.00   Max.   :24793   Max.   :120.00   Max.   :7302      
##                   NA's   :1                        NA's   :1         
##  Date.of.Birth      Employment.Type    DisbursalDate         State_ID     
##  Length:1311        Length:1311        Length:1311        Min.   : 1.000  
##  Class :character   Class :character   Class :character   1st Qu.: 4.000  
##  Mode  :character   Mode  :character   Mode  :character   Median : 6.000  
##                                                           Mean   : 6.905  
##                                                           3rd Qu.: 9.000  
##                                                           Max.   :21.000  
##                                                                           
##  Employee_code_ID MobileNo_Avl_Flag  Aadhar_flag        PAN_flag      
##  Min.   :   3.0   Min.   :1         Min.   :0.0000   Min.   :0.00000  
##  1st Qu.: 716.2   1st Qu.:1         1st Qu.:1.0000   1st Qu.:0.00000  
##  Median :1402.5   Median :1         Median :1.0000   Median :0.00000  
##  Mean   :1536.6   Mean   :1         Mean   :0.8375   Mean   :0.07933  
##  3rd Qu.:2286.5   3rd Qu.:1         3rd Qu.:1.0000   3rd Qu.:0.00000  
##  Max.   :3774.0   Max.   :1         Max.   :1.0000   Max.   :1.00000  
##  NA's   :1                                                            
##   VoterID_flag     Driving_flag     Passport_flag      PERFORM_CNS.SCORE
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.000000   Min.   :  0.0    
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:  0.0    
##  Median :0.0000   Median :0.00000   Median :0.000000   Median :563.0    
##  Mean   :0.1426   Mean   :0.02822   Mean   :0.003051   Mean   :415.5    
##  3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.000000   3rd Qu.:737.0    
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.000000   Max.   :853.0    
##                                                                         
##  PERFORM_CNS.SCORE.DESCRIPTION PRI.NO.OF.ACCTS   PRI.ACTIVE.ACCTS
##  Length:1311                   Min.   :  0.000   Min.   : 0.00   
##  Class :character              1st Qu.:  0.000   1st Qu.: 0.00   
##  Mode  :character              Median :  2.000   Median : 1.00   
##                                Mean   :  3.577   Mean   : 1.57   
##                                3rd Qu.:  4.000   3rd Qu.: 2.00   
##                                Max.   :104.000   Max.   :24.00   
##                                                                  
##  PRI.OVERDUE.ACCTS PRI.CURRENT.BALANCE PRI.SANCTIONED.AMOUNT
##  Min.   :0.0000    Min.   :   -3532    Min.   :       0     
##  1st Qu.:0.0000    1st Qu.:       0    1st Qu.:       0     
##  Median :0.0000    Median :    8164    Median :   23820     
##  Mean   :0.2159    Mean   :  217109    Mean   :  270625     
##  3rd Qu.:0.0000    3rd Qu.:   93738    3rd Qu.:  150000     
##  Max.   :8.0000    Max.   :10091566    Max.   :10000000     
##                    NA's   :1           NA's   :1            
##  PRI.DISBURSED.AMOUNT PRIMARY.INSTAL.AMT NEW.ACCTS.IN.LAST.SIX.MONTHS
##  Min.   :       0     Min.   :      0    Min.   : 0.0000             
##  1st Qu.:       0     1st Qu.:      0    1st Qu.: 0.0000             
##  Median :   23056     Median :      0    Median : 0.0000             
##  Mean   :  271799     Mean   :  14506    Mean   : 0.5728             
##  3rd Qu.:  150000     3rd Qu.:   4750    3rd Qu.: 1.0000             
##  Max.   :10000000     Max.   :4189968    Max.   :13.0000             
##  NA's   :1                                                           
##  DELINQUENT.ACCTS.IN.LAST.SIX.MONTHS AVERAGE.ACCT.AGE   CREDIT.HISTORY.LENGTH
##  Min.   :0.0000                      Length:1311        Length:1311          
##  1st Qu.:0.0000                      Class :character   Class :character     
##  Median :0.0000                      Mode  :character   Mode  :character     
##  Mean   :0.1373                                                              
##  3rd Qu.:0.0000                                                              
##  Max.   :7.0000                                                              
##                                                                              
##  NO.OF_INQUIRIES    loan_default   
##  Min.   : 0.0000   Min.   :0.0000  
##  1st Qu.: 0.0000   1st Qu.:0.0000  
##  Median : 0.0000   Median :0.0000  
##  Mean   : 0.2906   Mean   :0.2586  
##  3rd Qu.: 0.0000   3rd Qu.:1.0000  
##  Max.   :13.0000   Max.   :1.0000  
## 
summary(data_dict)
##  Variable Name      Description       
##  Length:41          Length:41         
##  Class :character   Class :character  
##  Mode  :character   Mode  :character

3. Le snorkeling

3.0.1 Chargement des fichiers et premières impressions

Let’s use the library skim to work with our grouped dataset.

skim(train)
Data summary
Name train
Number of rows 1311
Number of columns 34
_______________________
Column type frequency:
character 6
numeric 28
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Date.of.Birth 0 1.00 10 10 0 996 0
Employment.Type 37 0.97 8 13 0 2 0
DisbursalDate 0 1.00 10 10 0 83 0
PERFORM_CNS.SCORE.DESCRIPTION 0 1.00 10 55 0 19 0
AVERAGE.ACCT.AGE 0 1.00 9 11 0 82 0
CREDIT.HISTORY.LENGTH 0 1.00 9 11 0 130 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
UniqueID 1 1 554379.60 68099.09 417614.00 499762.00 569888.00 610596.50 653991.00 ▅▃▅▇▇
disbursed_amount 1 1 55467.52 13854.46 15579.00 48182.00 54513.00 60959.00 153318.00 ▂▇▁▁▁
asset_cost 1 1 76608.23 20066.28 39684.00 66279.25 70802.00 78905.25 215528.00 ▇▃▁▁▁
ltv 0 1 75.43 11.61 25.72 69.83 77.42 84.38 94.95 ▁▁▂▇▆
branch_id 0 1 70.17 68.48 1.00 13.00 61.00 120.00 260.00 ▇▅▃▁▁
supplier_id 1 1 19499.23 3541.37 12797.00 16264.00 18709.00 22991.25 24793.00 ▃▅▃▃▇
manufacturer_id 0 1 69.25 21.78 45.00 48.00 86.00 86.00 120.00 ▇▁▇▁▁
Current_pincode_ID 1 1 3216.27 2163.62 5.00 1484.00 2783.00 5232.50 7302.00 ▇▇▅▃▆
State_ID 0 1 6.91 4.35 1.00 4.00 6.00 9.00 21.00 ▇▅▂▁▁
Employee_code_ID 1 1 1536.63 978.28 3.00 716.25 1402.50 2286.50 3774.00 ▇▇▆▅▃
MobileNo_Avl_Flag 0 1 1.00 0.00 1.00 1.00 1.00 1.00 1.00 ▁▁▇▁▁
Aadhar_flag 0 1 0.84 0.37 0.00 1.00 1.00 1.00 1.00 ▂▁▁▁▇
PAN_flag 0 1 0.08 0.27 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
VoterID_flag 0 1 0.14 0.35 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
Driving_flag 0 1 0.03 0.17 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
Passport_flag 0 1 0.00 0.06 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
PERFORM_CNS.SCORE 0 1 415.52 335.78 0.00 0.00 563.00 737.00 853.00 ▇▁▁▅▇
PRI.NO.OF.ACCTS 0 1 3.58 5.92 0.00 0.00 2.00 4.00 104.00 ▇▁▁▁▁
PRI.ACTIVE.ACCTS 0 1 1.57 2.33 0.00 0.00 1.00 2.00 24.00 ▇▁▁▁▁
PRI.OVERDUE.ACCTS 0 1 0.22 0.61 0.00 0.00 0.00 0.00 8.00 ▇▁▁▁▁
PRI.CURRENT.BALANCE 1 1 217109.47 728115.00 -3532.00 0.00 8164.50 93737.50 10091566.00 ▇▁▁▁▁
PRI.SANCTIONED.AMOUNT 1 1 270625.16 788387.60 0.00 0.00 23819.50 150000.00 10000000.00 ▇▁▁▁▁
PRI.DISBURSED.AMOUNT 1 1 271799.44 809182.76 0.00 0.00 23055.50 150000.00 10000000.00 ▇▁▁▁▁
PRIMARY.INSTAL.AMT 0 1 14506.10 144638.06 0.00 0.00 0.00 4750.00 4189968.00 ▇▁▁▁▁
NEW.ACCTS.IN.LAST.SIX.MONTHS 0 1 0.57 1.14 0.00 0.00 0.00 1.00 13.00 ▇▁▁▁▁
DELINQUENT.ACCTS.IN.LAST.SIX.MONTHS 0 1 0.14 0.45 0.00 0.00 0.00 0.00 7.00 ▇▁▁▁▁
NO.OF_INQUIRIES 0 1 0.29 0.90 0.00 0.00 0.00 0.00 13.00 ▇▁▁▁▁
loan_default 0 1 0.26 0.44 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▃

Choissons quelques variables pour voir la structure de notre dataset.

options(tibble.width = Inf) # displays all columns.

# Check specific data structures
train %>% dplyr::select(ltv, 
                           disbursed_amount,
                           "Date.of.Birth",
                           supplier_id,
                           "Employment.Type",
                           "PERFORM_CNS.SCORE.DESCRIPTION",
                           loan_default,
                           "AVERAGE.ACCT.AGE",
                           "CREDIT.HISTORY.LENGTH") %>%  head()
## # A tibble: 6 × 9
##     ltv disbursed_amount Date.of.Birth supplier_id Employment.Type
##   <dbl>            <dbl> <chr>               <dbl> <chr>          
## 1  77.2            51578 25/06/1982          15899 Salaried       
## 2  60.4            36975 20/08/1966          17916 Self employed  
## 3  89.4            64482 25/10/1995          22708 Salaried       
## 4  69.6           116908 01/06/1981          23002 Self employed  
## 5  59.9            41381 23/07/1985          23483 Self employed  
## 6  65.0            72717 09/11/1988          23150 Self employed  
##   PERFORM_CNS.SCORE.DESCRIPTION                loan_default AVERAGE.ACCT.AGE
##   <chr>                                               <dbl> <chr>           
## 1 B-Very Low Risk                                         0 4yrs 7mon       
## 2 H-Medium Risk                                           0 1yrs 3mon       
## 3 A-Very Low Risk                                         0 1yrs 1mon       
## 4 Not Scored: Only a Guarantor                            0 3yrs 10mon      
## 5 C-Very Low Risk                                         1 0yrs 11mon      
## 6 Not Scored: Sufficient History Not Available            1 0yrs 0mon       
##   CREDIT.HISTORY.LENGTH
##   <chr>                
## 1 4yrs 7mon            
## 2 3yrs 0mon            
## 3 2yrs 6mon            
## 4 3yrs 10mon           
## 5 1yrs 8mon            
## 6 0yrs 0mon

3.0.2 Echange avec les métiers

Après cette première phase de snorkeling, nous avons pu constater la structure du dataset et les différentes variables. Cette première étape nous permet d’avoir une première idée des variables pertinentes. Nous avons aussi fait des changements sur le type des variables pour retyper les variables en factor pour les modalités.


4. Deep Dive

4.1 Wrangling et mise en forme

Dans cette phase, nous avons décidé de modifier la structure de la variable PERFORM_CNS.SCORE.DESCRIPTION. En effet, cette variable donne une description du score du bureau et présente 19 modalités différentes. Nous avons regroupé les modalités en quatre modalités plus larges de façon à rendre le calcul de la régression logistique plus efficace. Les quatre nouvelles modalités sont les suivantes : High, Low, Medium et Not Scored.

Nous avons ensuite modifié les variables AVERAGE.ACCT.AGE et CREDIT.HISTORY.LENGTH représentant respectivement la durée moyenne des prêts et le temps écoulé depuis le premier prêt. Nous les avons convertis en nombre de mois à la place du format Xyrs Ymon. Cette modification nous permet d’avoir des variables numeric pour étudier leur impact.

De même pour les variables Date.of.Birth et DisbursalDate. Nous les avons respectivement convertis en BorrowerAge (Age de l’emprunteur) et NbrMonthRelation (Nombre de mois pour le décaissement) à la place du format Day/Month/Year.

En plottant les différentes variables, nous nous rendons compte que certaines variables ne sont pas pertinentes car elles n’apportent pas d’informations sur l’individu. C’est le cas des variables Passport.flag et MobileNo_Avl_Flag. En effet, l’ensemble des individus possède la même valeur pour ces deux variables ce qui les rend useless.

# Factor the drivers
wrangled_train <- train %>% mutate(
  Employment.Type =replace_na(train$Employment.Type,"None"),
  State_ID = as.factor(State_ID),
  loan_default = as.factor(loan_default),
  VoterID_flag = as.factor(VoterID_flag),
  supplier_id = as.factor(supplier_id),
  manufacturer_id = as.factor(manufacturer_id),
  Current_pincode_ID = as.factor(Current_pincode_ID),
  MobileNo_Avl_Flag = as.factor(MobileNo_Avl_Flag),
  Driving_flag = as.factor(Driving_flag),
  Passport_flag = as.factor(Passport_flag)
)

# Factor the drivers
wrangled_test <- test %>% mutate(
  Employment.Type =replace_na(test$Employment.Type,"None"),
  State_ID = as.factor(State_ID),
  loan_default = as.factor(loan_default),
  VoterID_flag = as.factor(VoterID_flag),
  supplier_id = as.factor(supplier_id),
  manufacturer_id = as.factor(manufacturer_id),
  Current_pincode_ID = as.factor(Current_pincode_ID),
  MobileNo_Avl_Flag = as.factor(MobileNo_Avl_Flag),
  Driving_flag = as.factor(Driving_flag),
  Passport_flag = as.factor(Passport_flag)
)

On calcule les pourcentages en lignes avec la fonction lprop:

# Check data occurences
wrangled_train %>% count(PERFORM_CNS.SCORE.DESCRIPTION)
## # A tibble: 19 × 2
##    PERFORM_CNS.SCORE.DESCRIPTION                               n
##    <chr>                                                   <int>
##  1 A-Very Low Risk                                           123
##  2 B-Very Low Risk                                            71
##  3 C-Very Low Risk                                           138
##  4 D-Very Low Risk                                            85
##  5 E-Low Risk                                                 38
##  6 F-Low Risk                                                 67
##  7 G-Low Risk                                                 30
##  8 H-Medium Risk                                              61
##  9 I-Medium Risk                                             101
## 10 J-High Risk                                                36
## 11 K-High Risk                                                67
## 12 L-Very High Risk                                           11
## 13 M-Very High Risk                                           68
## 14 No Bureau History Available                               305
## 15 Not Scored: No Activity seen on the customer (Inactive)    34
## 16 Not Scored: No Updates available in last 36 months          9
## 17 Not Scored: Not Enough Info available on the customer      34
## 18 Not Scored: Only a Guarantor                               10
## 19 Not Scored: Sufficient History Not Available               23
wrangled_train <- wrangled_train %>%
  mutate(credit_category = case_when(
    startsWith(PERFORM_CNS.SCORE.DESCRIPTION, "M-") ~ "High",
    startsWith(PERFORM_CNS.SCORE.DESCRIPTION, "F-") ~ "Low",
    startsWith(PERFORM_CNS.SCORE.DESCRIPTION, "D-") ~ "Low",
    startsWith(PERFORM_CNS.SCORE.DESCRIPTION, "K-") ~ "High",
    startsWith(PERFORM_CNS.SCORE.DESCRIPTION, "B-") ~ "Low",
    startsWith(PERFORM_CNS.SCORE.DESCRIPTION, "G-") ~ "Low",
    startsWith(PERFORM_CNS.SCORE.DESCRIPTION, "I-") ~ "Medium",
    startsWith(PERFORM_CNS.SCORE.DESCRIPTION, "Not Scored") ~ "Not Scored",
    startsWith(PERFORM_CNS.SCORE.DESCRIPTION, "A-") ~ "Low",
    startsWith(PERFORM_CNS.SCORE.DESCRIPTION, "C-") ~ "Low",
    startsWith(PERFORM_CNS.SCORE.DESCRIPTION, "No Bureau") ~ "Not Scored",
    startsWith(PERFORM_CNS.SCORE.DESCRIPTION, "E-") ~ "Low",
    startsWith(PERFORM_CNS.SCORE.DESCRIPTION, "H-") ~ "Medium",
    startsWith(PERFORM_CNS.SCORE.DESCRIPTION, "J-") ~ "High",
    startsWith(PERFORM_CNS.SCORE.DESCRIPTION, "L-") ~ "High",
  ))

# Utilize lprop function to calculate percentages
tab <- table(wrangled_train$PERFORM_CNS.SCORE.DESCRIPTION,wrangled_train$loan_default)
lprop(tab)
##                                                          
##                                                           0     1     Total
##   A-Very Low Risk                                          90.2   9.8 100.0
##   B-Very Low Risk                                          90.1   9.9 100.0
##   C-Very Low Risk                                          82.6  17.4 100.0
##   D-Very Low Risk                                          77.6  22.4 100.0
##   E-Low Risk                                               92.1   7.9 100.0
##   F-Low Risk                                               79.1  20.9 100.0
##   G-Low Risk                                               86.7  13.3 100.0
##   H-Medium Risk                                            63.9  36.1 100.0
##   I-Medium Risk                                            72.3  27.7 100.0
##   J-High Risk                                              50.0  50.0 100.0
##   K-High Risk                                              46.3  53.7 100.0
##   L-Very High Risk                                         27.3  72.7 100.0
##   M-Very High Risk                                         39.7  60.3 100.0
##   No Bureau History Available                              76.1  23.9 100.0
##   Not Scored: No Activity seen on the customer (Inactive)  76.5  23.5 100.0
##   Not Scored: No Updates available in last 36 months       88.9  11.1 100.0
##   Not Scored: Not Enough Info available on the customer    76.5  23.5 100.0
##   Not Scored: Only a Guarantor                             50.0  50.0 100.0
##   Not Scored: Sufficient History Not Available             65.2  34.8 100.0
##   All                                                      74.1  25.9 100.0
# Utilize lprop function to calculate percentages
tab_prime <- table(wrangled_train$credit_category,wrangled_train$loan_default)
lprop(tab_prime)
##             
##              0     1     Total
##   High        43.4  56.6 100.0
##   Low         85.0  15.0 100.0
##   Medium      69.1  30.9 100.0
##   Not Scored  75.2  24.8 100.0
##   All         74.1  25.9 100.0

Parsing utilisant le regex.

# Calculate yrs mon parse to months
wrangled_train <- wrangled_train %>%
  mutate(AAA = 12 * as.numeric(str_extract(AVERAGE.ACCT.AGE, pattern="(\\d)+(?=yrs)")) + as.numeric(str_extract(AVERAGE.ACCT.AGE, pattern="(\\d)+(?=mon)")),
         CHL = 12 * as.numeric(str_extract(CREDIT.HISTORY.LENGTH, pattern="(\\d)+(?=yrs)")) + as.numeric(str_extract(CREDIT.HISTORY.LENGTH, pattern="(\\d)+(?=mon)")))

wrangled_train %>% select(AVERAGE.ACCT.AGE,
                AAA,
                CREDIT.HISTORY.LENGTH,
                CHL)
## # A tibble: 1,311 × 4
##    AVERAGE.ACCT.AGE   AAA CREDIT.HISTORY.LENGTH   CHL
##    <chr>            <dbl> <chr>                 <dbl>
##  1 4yrs 7mon           55 4yrs 7mon                55
##  2 1yrs 3mon           15 3yrs 0mon                36
##  3 1yrs 1mon           13 2yrs 6mon                30
##  4 3yrs 10mon          46 3yrs 10mon               46
##  5 0yrs 11mon          11 1yrs 8mon                20
##  6 0yrs 0mon            0 0yrs 0mon                 0
##  7 0yrs 0mon            0 0yrs 0mon                 0
##  8 0yrs 0mon            0 0yrs 2mon                 2
##  9 2yrs 3mon           27 5yrs 9mon                69
## 10 0yrs 0mon            0 0yrs 0mon                 0
## # … with 1,301 more rows
# Calculate yrs mon parse to months
wrangled_test <- wrangled_test %>%
  mutate(AAA = 12 * as.numeric(str_extract(AVERAGE.ACCT.AGE, pattern="(\\d)+(?=yrs)")) + as.numeric(str_extract(AVERAGE.ACCT.AGE, pattern="(\\d)+(?=mon)")),
         CHL = 12 * as.numeric(str_extract(CREDIT.HISTORY.LENGTH, pattern="(\\d)+(?=yrs)")) + as.numeric(str_extract(CREDIT.HISTORY.LENGTH, pattern="(\\d)+(?=mon)")))

On calcule maintenant l’age de borrower et le nombre de mois en relation avec le disbursal date.

# Insert function to calculate age
# Transform character columns to date
wrangled_train$Date.of.Birth <- dmy(wrangled_train$Date.of.Birth)
wrangled_train$DisbursalDate <- dmy(wrangled_train$DisbursalDate)

# diff_in_days = difftime(datetimes[2], datetimes[1], units = "days")

wrangled_train <- wrangled_train %>% mutate(
  BorrowerAge = MESS::age(Date.of.Birth, today()),
  NbrMonthRelation = -1 * as.numeric(difftime(DisbursalDate, today(), units = "days") / 30)
)

wrangled_train %>% select(Date.of.Birth,
                BorrowerAge,
                DisbursalDate,
                NbrMonthRelation)
## # A tibble: 1,311 × 4
##    Date.of.Birth BorrowerAge DisbursalDate NbrMonthRelation
##    <date>              <dbl> <date>                   <dbl>
##  1 1982-06-25             39 2018-10-13                38.5
##  2 1966-08-20             55 2018-10-26                38.1
##  3 1995-10-25             26 2018-10-09                38.6
##  4 1981-06-01             40 2018-10-27                38.0
##  5 1985-07-23             36 2018-09-28                39  
##  6 1988-11-09             33 2018-09-18                39.3
##  7 1997-01-01             24 2018-10-27                38.0
##  8 1960-04-15             61 2018-10-11                38.6
##  9 1983-12-22             37 2018-10-30                37.9
## 10 1976-01-01             45 2018-10-10                38.6
## # … with 1,301 more rows
# Insert function to calculate age
# Transform character columns to date
wrangled_test$Date.of.Birth <- dmy(wrangled_test$Date.of.Birth)
wrangled_test$DisbursalDate <- dmy(wrangled_test$DisbursalDate)

# diff_in_days = difftime(datetimes[2], datetimes[1], units = "days")

wrangled_test <- wrangled_test %>% mutate(
  BorrowerAge = MESS::age(Date.of.Birth, today()),
  NbrMonthRelation = -1 * as.numeric(difftime(DisbursalDate, today(), units = "days") / 30)
)
library(gridExtra)

# Grid plot ggplot

p1 <- ggplot(wrangled_train, aes(x=AAA)) + geom_histogram()
p2 <- ggplot(wrangled_train, aes(x=Aadhar_flag)) + geom_histogram()
p3 <- ggplot(wrangled_train, aes(x=asset_cost)) + geom_histogram()
p4 <- ggplot(wrangled_train, aes(x=BorrowerAge)) + geom_histogram()
p5 <- ggplot(wrangled_train, aes(x=branch_id)) + geom_histogram()
p6 <- ggplot(wrangled_train, aes(x=CHL)) + geom_histogram()
p7 <- ggplot(wrangled_train, aes(x=DELINQUENT.ACCTS.IN.LAST.SIX.MONTHS)) + geom_histogram()
p8 <- ggplot(wrangled_train, aes(x=disbursed_amount)) + geom_histogram()
p9 <- ggplot(wrangled_train, aes(x=Employee_code_ID)) + geom_histogram()
p10 <- ggplot(wrangled_train, aes(x=ltv)) + geom_histogram()
p11 <- ggplot(wrangled_train, aes(x=NbrMonthRelation)) + geom_histogram()
p12 <- ggplot(wrangled_train, aes(x=NEW.ACCTS.IN.LAST.SIX.MONTHS)) + geom_histogram()
p13 <- ggplot(wrangled_train, aes(x=NO.OF_INQUIRIES)) + geom_histogram()
p14 <- ggplot(wrangled_train, aes(x=PAN_flag)) + geom_histogram()
p15 <- ggplot(wrangled_train, aes(x=PERFORM_CNS.SCORE)) + geom_histogram()
p16 <- ggplot(wrangled_train, aes(x=PRI.ACTIVE.ACCTS)) + geom_histogram()
p17 <- ggplot(wrangled_train, aes(x=PRI.CURRENT.BALANCE)) + geom_histogram()
p18 <- ggplot(wrangled_train, aes(x=PRI.DISBURSED.AMOUNT)) + geom_histogram()
p19 <- ggplot(wrangled_train, aes(x=PRI.NO.OF.ACCTS)) + geom_histogram()
p20 <- ggplot(wrangled_train, aes(x=PRI.OVERDUE.ACCTS)) + geom_histogram()
p21 <- ggplot(wrangled_train, aes(x=PRI.SANCTIONED.AMOUNT)) + geom_histogram()
p22 <- ggplot(wrangled_train, aes(x=PRIMARY.INSTAL.AMT)) + geom_histogram()
p23 <- ggplot(wrangled_train, aes(x=UniqueID)) + geom_histogram()

grid.arrange(p1, p2, p3, p4,
             p5, p6, p7, p8,
             p9, p10, p11, p12,
             p13, p14, p15, p16, 
             p17, p18, p19, p20,
             p21, p22, p23, ncol=3, nrow = 8)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Categoriel.

p1 <- ggplot(wrangled_train, aes(x=Current_pincode_ID)) + geom_histogram(stat="count")
p2 <- ggplot(wrangled_train, aes(x=Driving_flag)) + geom_histogram(stat="count")
p3 <- ggplot(wrangled_train, aes(x=loan_default)) + geom_histogram(stat="count")
p4 <- ggplot(wrangled_train, aes(x=manufacturer_id)) + geom_histogram(stat="count")
p5 <- ggplot(wrangled_train, aes(x=MobileNo_Avl_Flag)) + geom_histogram(stat="count")
p6 <- ggplot(wrangled_train, aes(x=Passport_flag)) + geom_histogram(stat="count")
p7 <- ggplot(wrangled_train, aes(x=State_ID)) + geom_histogram(stat="count")
p8 <- ggplot(wrangled_train, aes(x=supplier_id)) + geom_histogram(stat="count")
p9 <- ggplot(wrangled_train, aes(x=VoterID_flag)) + geom_histogram(stat="count")


grid.arrange(p1, p2, p3, p4,
             p5, p6, p7, p8,
             p9, ncol=3, nrow = 3)

4.3 Sélection des variables pour le modèle

4.3.1 Analyses univariées & bivariées

En réalisant le boxplot de la variable BorrowerAge, nous nous rendons compte que la saisie d’une des individus est fausse. En effet, le boxplot nous indique qu’un individu a 135 ans ce qui est impossible. On observe aussi un outlier, un individu en défaut ayant 92 ans.

boxplot(wrangled_train$BorrowerAge,main="Age de le borrower",
        xlab="L ecart inter-quartile h", ylab="Heures",horizontal = FALSE)

Boxplot + violon plot.

p1 <- ggplot(wrangled_train, aes(x=wrangled_train$BorrowerAge)) + geom_boxplot() + xlab("BorrowerAge") + ylab("loan_default")

p2 <- ggplot(wrangled_train, aes(x=wrangled_train$BorrowerAge, y=wrangled_train$loan_default)) + geom_boxplot() + xlab("BorrowerAge") + ylab("loan_default")

p3 <- ggplot(wrangled_train, aes(x=wrangled_train$BorrowerAge, y=wrangled_train$loan_default))  + geom_violin()  + geom_boxplot(width=.1) + xlab("BorrowerAge") + ylab("loan_default")

grid.arrange(p1, p2, p3, ncol=2, nrow = 2)

4.3.2 Discrétisation des variables

Dans cette section nous discrétisons les variables afin de réaliser notre régression logistique. Cette étape présente une perte d’informations mais elle est nécessaire pour le calcul de la régression logistique dans la mesure où un nombre trop important de valeurs rend le calcul impossible.

Les différents graphes nous donnent des informations pertinentes sur l’impact des différentes variables dans notre modèle.

Par exemple, le graphe de la variable ltv (Rapport prêt/valeur de l’actif) nous indique qu’il s’agit d’une variable croissante du défaut. La variable ltv a été discrétiser en trois classes [0 ; 55], [55, 85] et [85 ; inf]. Le pourcentage en bleu sur la courbe représente le rapport entre la proportion de rouge (bas) sur la population entière. On observe que plus la ltv est importante et donc plus le prêt est important par rapport à la valeur de l’actif plus l’individu a des chances de faire défaut.

Le graphe de la PERFORM_CNS.SCORE.DESCRIPTION nous montre aussi qu’il s’agit d’une variable croissante du défaut. En effet, le ratio d’individu bad sur l’ensemble de la population augmente en fonction de la notation du bureau (on observe 56% de pour le score high ce qui signifie que plus d’un individu sur 2 fait défaut dans cette catégorie).

to_plot_cat <- c("ltv", "disbursed_amount", "PRI.CURRENT.BALANCE",
                   "PRI.DISBURSED.AMOUNT", "PRIMARY.INSTAL.AMT", "BorrowerAge",
                   "AAA", "PERFORM_CNS.SCORE.DESCRIPTION", "PRI.OVERDUE.ACCTS",
                   "DELINQUENT.ACCTS.IN.LAST.SIX.MONTHS")

for(i in to_plot_cat)
{
bins = woebin(wrangled_train, y="loan_default", x=i,bin_num_limit=3)
p1 = woebin_plot(bins)
print(p1)
}
## [INFO] creating woe binning ... 
## $ltv

## 
## [INFO] creating woe binning ... 
## $disbursed_amount

## 
## [INFO] creating woe binning ... 
## $PRI.CURRENT.BALANCE

## 
## [INFO] creating woe binning ... 
## $PRI.DISBURSED.AMOUNT

## 
## [INFO] creating woe binning ... 
## $PRIMARY.INSTAL.AMT

## 
## [INFO] creating woe binning ... 
## $BorrowerAge

## 
## [INFO] creating woe binning ... 
## $AAA

## 
## [INFO] creating woe binning ... 
## $PERFORM_CNS.SCORE.DESCRIPTION

## 
## [INFO] creating woe binning ... 
## $PRI.OVERDUE.ACCTS

## 
## [INFO] creating woe binning ... 
## $DELINQUENT.ACCTS.IN.LAST.SIX.MONTHS

4.3.3 Pouvoir discriminant des variables

iv = iv(wrangled_train, y = 'loan_default') %>%
  as_tibble() %>%
  mutate( info_value = round(info_value, 3) ) %>%
  arrange( desc(info_value) )

iv %>%
  knitr::kable()
variable info_value
PERFORM_CNS.SCORE 0.573
PERFORM_CNS.SCORE.DESCRIPTION 0.564
credit_category 0.464
branch_id 0.374
PRI.DISBURSED.AMOUNT 0.361
PRI.SANCTIONED.AMOUNT 0.358
PRIMARY.INSTAL.AMT 0.353
CREDIT.HISTORY.LENGTH 0.326
CHL 0.326
DisbursalDate 0.317
NbrMonthRelation 0.317
supplier_id 0.316
PRI.CURRENT.BALANCE 0.313
disbursed_amount 0.240
PRI.OVERDUE.ACCTS 0.234
AVERAGE.ACCT.AGE 0.222
AAA 0.222
DELINQUENT.ACCTS.IN.LAST.SIX.MONTHS 0.199
BorrowerAge 0.194
Current_pincode_ID 0.181
Date.of.Birth 0.155
Employee_code_ID 0.150
PRI.NO.OF.ACCTS 0.120
ltv 0.112
PRI.ACTIVE.ACCTS 0.100
State_ID 0.083
NO.OF_INQUIRIES 0.055
asset_cost 0.048
NEW.ACCTS.IN.LAST.SIX.MONTHS 0.022
manufacturer_id 0.013
Employment.Type 0.007
VoterID_flag 0.006
Aadhar_flag 0.005
Passport_flag 0.004
PAN_flag 0.003
Driving_flag 0.003
UniqueID 0.000
MobileNo_Avl_Flag 0.000
# bins = woebin(wrangled_train, y = 'loan_default')

data_woe = woebin_ply(wrangled_train, bins) %>% 
  as_tibble()
## [INFO] converting into woe values ...
data_woe
## # A tibble: 1,311 × 39
##    UniqueID disbursed_amount asset_cost   ltv branch_id supplier_id
##       <dbl>            <dbl>      <dbl> <dbl>     <dbl> <fct>      
##  1   576629            51578      68000  77.2         2 15899      
##  2   624008            36975      66166  60.4         1 17916      
##  3   564064            64482      73813  89.4        18 22708      
##  4   628786           116908     174438  69.6        85 23002      
##  5   545869            41381      71570  59.9        18 23483      
##  6   515616            72717     115240  65.0        11 23150      
##  7   631989            72299     101666  74.8         3 17038      
##  8   568869            59388      71500  85          67 16694      
##  9   644157            50278      74325  68.9        77 18238      
## 10   567106            65882      89311  76.1        36 23697      
##    manufacturer_id Current_pincode_ID Date.of.Birth Employment.Type
##    <fct>           <fct>              <date>        <chr>          
##  1 51              1720               1982-06-25    Salaried       
##  2 86              4938               1966-08-20    Self employed  
##  3 86              2697               1995-10-25    Salaried       
##  4 67              1935               1981-06-01    Self employed  
##  5 51              2721               1985-07-23    Self employed  
##  6 48              5956               1988-11-09    Self employed  
##  7 51              571                1997-01-01    Salaried       
##  8 86              1482               1960-04-15    Self employed  
##  9 120             2343               1983-12-22    Salaried       
## 10 120             6619               1976-01-01    Salaried       
##    DisbursalDate State_ID Employee_code_ID MobileNo_Avl_Flag Aadhar_flag
##    <date>        <fct>               <dbl> <fct>                   <dbl>
##  1 2018-10-13    4                    1635 1                           1
##  2 2018-10-26    3                    1101 1                           1
##  3 2018-10-09    4                    1709 1                           1
##  4 2018-10-27    4                     983 1                           1
##  5 2018-09-28    4                     451 1                           1
##  6 2018-09-18    3                       3 1                           1
##  7 2018-10-27    15                    624 1                           1
##  8 2018-10-11    6                     169 1                           1
##  9 2018-10-30    4                    2197 1                           1
## 10 2018-10-10    13                   1335 1                           0
##    PAN_flag VoterID_flag Driving_flag Passport_flag PERFORM_CNS.SCORE
##       <dbl> <fct>        <fct>        <fct>                     <dbl>
##  1        0 0            0            0                           783
##  2        0 0            0            0                           611
##  3        0 0            0            0                           821
##  4        1 0            0            0                            14
##  5        0 0            0            0                           738
##  6        0 0            0            0                            15
##  7        1 0            0            0                             0
##  8        0 0            0            0                           362
##  9        0 0            0            0                           587
## 10        0 1            0            0                             0
##    PERFORM_CNS.SCORE.DESCRIPTION                PRI.NO.OF.ACCTS PRI.ACTIVE.ACCTS
##    <chr>                                                  <dbl>            <dbl>
##  1 B-Very Low Risk                                            1                1
##  2 H-Medium Risk                                             24                5
##  3 A-Very Low Risk                                            4                2
##  4 Not Scored: Only a Guarantor                               1                0
##  5 C-Very Low Risk                                            7                4
##  6 Not Scored: Sufficient History Not Available               1                1
##  7 No Bureau History Available                                0                0
##  8 K-High Risk                                                3                3
##  9 I-Medium Risk                                              5                5
## 10 No Bureau History Available                                0                0
##    PRI.OVERDUE.ACCTS PRI.CURRENT.BALANCE PRI.SANCTIONED.AMOUNT
##                <dbl>               <dbl>                 <dbl>
##  1                 0              771804               1600000
##  2                 1              280921                425111
##  3                 0                2302                 39906
##  4                 0                   0                     0
##  5                 0               53462                 76536
##  6                 0               11076                 11076
##  7                 0                   0                     0
##  8                 1               66443                 77150
##  9                 2              709398                772628
## 10                 0                   0                     0
##    PRI.DISBURSED.AMOUNT PRIMARY.INSTAL.AMT NEW.ACCTS.IN.LAST.SIX.MONTHS
##                   <dbl>              <dbl>                        <dbl>
##  1              1600000              15707                            0
##  2               425111               4014                            0
##  3                 9163               1399                            0
##  4                    0                  0                            0
##  5                76536               8538                            1
##  6                11076               1496                            1
##  7                    0                  0                            0
##  8                77150               6884                            3
##  9               773389               8519                            1
## 10                    0                  0                            0
##    AVERAGE.ACCT.AGE CREDIT.HISTORY.LENGTH NO.OF_INQUIRIES loan_default
##    <chr>            <chr>                           <dbl> <fct>       
##  1 4yrs 7mon        4yrs 7mon                           0 0           
##  2 1yrs 3mon        3yrs 0mon                           0 0           
##  3 1yrs 1mon        2yrs 6mon                           0 0           
##  4 3yrs 10mon       3yrs 10mon                          0 0           
##  5 0yrs 11mon       1yrs 8mon                           1 1           
##  6 0yrs 0mon        0yrs 0mon                           1 1           
##  7 0yrs 0mon        0yrs 0mon                           0 1           
##  8 0yrs 0mon        0yrs 2mon                           0 1           
##  9 2yrs 3mon        5yrs 9mon                           0 0           
## 10 0yrs 0mon        0yrs 0mon                           0 1           
##    credit_category   AAA   CHL BorrowerAge NbrMonthRelation
##    <chr>           <dbl> <dbl>       <dbl>            <dbl>
##  1 Low                55    55          39             38.5
##  2 Medium             15    36          55             38.1
##  3 Low                13    30          26             38.6
##  4 Not Scored         46    46          40             38.0
##  5 Low                11    20          36             39  
##  6 Not Scored          0     0          33             39.3
##  7 Not Scored          0     0          24             38.0
##  8 High                0     2          61             38.6
##  9 Medium             27    69          37             37.9
## 10 Not Scored          0     0          45             38.6
##    DELINQUENT.ACCTS.IN.LAST.SIX.MONTHS_woe
##                                      <dbl>
##  1                                  -0.173
##  2                                  -0.173
##  3                                  -0.173
##  4                                  -0.173
##  5                                  -0.173
##  6                                  -0.173
##  7                                  -0.173
##  8                                   1.09 
##  9                                   1.09 
## 10                                  -0.173
## # … with 1,301 more rows

4.3.4 Corrélations intra modèle (Étape optionnelle)

La matrice de corrélation obtenue avec le V de Cramer nous permet de connaitre la corrélation des différentes variables avec loan_default. Dans un premier temps, nous avons calculé la corrélation de toutes les variables présentes dans notre dataset. Après une première étude des résultats, nous avons décidés d’écarter certaines variables n’étant pas assez corrélées avec la variable loan_default. Dans un second temps, nous avons réalisé à nouveau un V de Cramer avec nos variables sélectionnées et nous obtenons de nouvelles valeurs de corrélation. Nous observons que 3 variables ont une bonne corrélation avec la variable loan_default, il s’agit des variables ltv_bin, NbrMonthRelation_bin, et PERFORM_CNS.SCORE.DESCRIPTION_bin.

Cependant la corrélation de la variable ltv avec loan_default_ ainsi qu’avec les autres variables semble être erroné malgré le sens de cette variable vis-à-vis de notre étude.

Nous pouvons ainsi réaliser notre régression logistique avec ces variables.

# 
# ALL_COLUMN_NAME <- wrangled_train %>% colnames()
# 
# ## Fonction de calcul d'un v de cramer
# fCramerFunction = function(x,y) {
#  tbl =  wrangled_train %>% select(x,y) %>% table()
#  cramV = round(cramer.v(tbl), 2)
#  data.frame(x, y, cramV) }
# 
# # create unique combinations of column names
# # sorting will help getting a better plot (upper triangular)
# df_comb = data.frame(t(combn(sort(c(ALL_COLUMN_NAME,"loan_default")), 2)), stringsAsFactors = F)
# 
# # apply function to each variable combination
# df_res = purrr::map2_df(df_comb$X1, df_comb$X2, fCramerFunction)
# 
# df_res
# 
# # plot results
# df_res %>%
#  ggplot(aes(x,y,fill=cramV))+
#  geom_tile()+
#  geom_text(aes(x,y,label=cramV))+
#  scale_fill_gradient(low="white", high="red")+
#  theme_classic()+ theme(axis.text.x = element_text(angle = 60, hjust = 1))


#############################################

SELECTED_COLUMNS <- c("State_ID",
  "PERFORM_CNS.SCORE.DESCRIPTION",
  "PAN_flag",
  "NbrMonthRelation",
  "ltv",
  "loan_default",
  "Employment.Type",
  "BorrowerAge",
  "Aadhar_flag")

small_df <- wrangled_train %>% select(
  SELECTED_COLUMNS
)
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(SELECTED_COLUMNS)` instead of `SELECTED_COLUMNS` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
fCramerFunction = function(x,y) {
 tbl =  small_df %>% select(x,y) %>% table()
 cramV = round(cramer.v(tbl), 2) 
 data.frame(x, y, cramV) }

df_comb = data.frame(t(combn(sort(c(SELECTED_COLUMNS,"loan_default")), 2)), stringsAsFactors = F)

# apply function to each variable combination
df_res = purrr::map2_df(df_comb$X1, df_comb$X2, fCramerFunction)
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(x)` instead of `x` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(y)` instead of `y` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
df_res
##                                x                             y cramV
## 1                    Aadhar_flag                   BorrowerAge  0.17
## 2                    Aadhar_flag               Employment.Type  0.05
## 3                    Aadhar_flag                  loan_default  0.03
## 4                    Aadhar_flag                  loan_default  0.03
## 5                    Aadhar_flag                           ltv  0.90
## 6                    Aadhar_flag              NbrMonthRelation  0.24
## 7                    Aadhar_flag                      PAN_flag  0.18
## 8                    Aadhar_flag PERFORM_CNS.SCORE.DESCRIPTION  0.12
## 9                    Aadhar_flag                      State_ID  0.58
## 10                   BorrowerAge               Employment.Type  0.56
## 11                   BorrowerAge                  loan_default  0.21
## 12                   BorrowerAge                  loan_default  0.21
## 13                   BorrowerAge                           ltv  0.87
## 14                   BorrowerAge              NbrMonthRelation  0.26
## 15                   BorrowerAge                      PAN_flag  0.21
## 16                   BorrowerAge PERFORM_CNS.SCORE.DESCRIPTION  0.20
## 17                   BorrowerAge                      State_ID  0.18
## 18               Employment.Type                  loan_default  0.04
## 19               Employment.Type                  loan_default  0.04
## 20               Employment.Type                           ltv  0.88
## 21               Employment.Type              NbrMonthRelation  0.24
## 22               Employment.Type                      PAN_flag  0.04
## 23               Employment.Type PERFORM_CNS.SCORE.DESCRIPTION  0.13
## 24               Employment.Type                      State_ID  0.26
## 25                  loan_default                  loan_default    NA
## 26                  loan_default                           ltv  0.90
## 27                  loan_default              NbrMonthRelation  0.25
## 28                  loan_default                      PAN_flag  0.03
## 29                  loan_default PERFORM_CNS.SCORE.DESCRIPTION  0.33
## 30                  loan_default                      State_ID  0.14
## 31                  loan_default                           ltv  0.90
## 32                  loan_default              NbrMonthRelation  0.25
## 33                  loan_default                      PAN_flag  0.03
## 34                  loan_default PERFORM_CNS.SCORE.DESCRIPTION  0.33
## 35                  loan_default                      State_ID  0.14
## 36                           ltv              NbrMonthRelation  0.88
## 37                           ltv                      PAN_flag  0.91
## 38                           ltv PERFORM_CNS.SCORE.DESCRIPTION  0.90
## 39                           ltv                      State_ID  0.90
## 40              NbrMonthRelation                      PAN_flag  0.24
## 41              NbrMonthRelation PERFORM_CNS.SCORE.DESCRIPTION  0.30
## 42              NbrMonthRelation                      State_ID  0.26
## 43                      PAN_flag PERFORM_CNS.SCORE.DESCRIPTION  0.12
## 44                      PAN_flag                      State_ID  0.34
## 45 PERFORM_CNS.SCORE.DESCRIPTION                      State_ID  0.12
# plot results
df_res %>%
 ggplot(aes(x,y,fill=cramV))+
 geom_tile()+
 geom_text(aes(x,y,label=cramV))+
 scale_fill_gradient(low="white", high="red")+
 theme_classic()+ theme(axis.text.x = element_text(angle = 60, hjust = 1))

4.4 Régression logistique

Enfin, nous avons sélectionné nos variables en calculant le pouvoir discriminant des variables grâce à l’information value ainsi que la corrélation des variables avec notre cible loan_default. En fonction des résultats obtenus, nous avons décidé de prendre 3 variables pour calculer notre régression, il s’agit des variables ltv, PERFORM_CNS.SCORE.DESCRIPTION et NbrMonthRelation. Au-delà du pouvoir discriminant et de la corrélation de ces variables, elles ont du sens vis-à-vis de notre cas d’étude, ce qui fait sens.

Pour notre régression logistique, nous obtenons ce graphique avec une AUC = 0.6959. C’est une valeur élevée ce qui signifie que notre modèle est performant.

feature_vars = c("ltv", "NbrMonthRelation", "PERFORM_CNS.SCORE.DESCRIPTION")

glm_formulation_vars <- reformulate(termlabels = feature_vars,
                              response="loan_default")


logit = glm(glm_formulation_vars,
            data=wrangled_train,
            family=binomial(link="logit"))

wrangled_test$Score = predict(logit,
                                  newdata = wrangled_test,
                                  type="response")

roc_obj = roc(wrangled_test$loan_default ~ wrangled_test$Score,
          plot=TRUE, percent=TRUE, auc=TRUE)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

roc_obj$auc
## Area under the curve: 69.59%
ggplot(wrangled_test, aes(x=Score, color=loan_default)) + geom_density()